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Abstract 



One of the major problems that a tree-approcich to data analysis often 
encounters is :nsfc6:7% of tree-structures. Thus if one wishes to interprete the 
data structure by the tree- approach, the instability issue must be dealt with. 

Examining instability at a node of a tree provides insight into the instability 
of the whole tree, since the same theory of instability applies to all the nodes. 
Thus, this paper deals with the instability issue at a single node of a tree. 

We assume that data are from a regression model, and examine what fac- 
tors in that model affect the instability. Squared-error loss is considered as a 
criterion for tree-construction ("Is" criterion in CART program). The selec- 
tion rate of a regressor variable at a node of a tree is used as a measure of 
instability. The selection rate mainly depends on (i) regression coefficients, (ii) 
(conditional) variance-covaxiance structure of the regressor variables (given a 
subset of the regressor variables), (iii) the sample size, and (iv) noise in the re- 
sponse variable. We report simulation results that show patterns of instabiHty 
for several different settings of regression models. 
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1. INTRODUCTION AND MOTIVATION 

In a typical sequential prediction procedure, we observe explanatory or predictor 
variables, one after another, deciding after each observation whether or not to continue 
adding variables. In selecting the next predictor variable, we usually attempt to maximize 
the expected utility, which involves the total cost of variable observations and the loss from 
the decision. This sequential procedure can be depicted by a directed acyclic graph, called 
a tree. We, however, refer to a tree-structured statistical prediction system as a tree. 
Variables are observed at the nodes of a tree. 

Many of the presently available statistical techniques were designed for small data 
sets having standard structure with all variables of the same type; the underlying assumption 
was that the phenomenon is homogeneous. That is, that the same relationship between 
variables held over all of the measurement space. What makes a data set interesting is not 
only its size but also its complexity, where complexity can include such considerations as 
high dimensionality, a mixture of data types, nonstandard data structure and 
nonhomogeneity; that is, different relationships hold between variables in different parts of 
the measurement space. Tree-structured approaches have been suggested for data sets with 
such forms of complexity. 

Use of trees in regression dates back to the AID (Automatic Interaction Detection) 
program developed by Morgan and Sonquist (1964). Then followed the ancestor 
classification program THAID, developed by Morgan and Messenger (1973). Breiman, 
Friedman, Olshen, and Stone (1984) proposed an algorithm called Classification and 
Regression Trees which is designed to provide a statistical sequential decision aid to its 
users for classification or regression problems. If we are given appropriate data, then we can 
get a guide, in a form of an upside-down tree, to what order to observe the predictor 
variables, when to stop observation, and what decision to make! The computer program that 
is based on this algorithm is referred to as CART. Huang (1989) developed a tree-structured 
method of detecting nonlinearity of a regression model. CART is now one of the most 
popular tree-structured data analysis and pattern recognition programs, and is used by many 
statisticians and AI people. 
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By the nature of the tree-structured approach, the approach is available for a data 
set which involves any large number of variables, where the variables can be of any type. 
It is also useful when the true regression model is non-linear, since it provides us a rough 
picture of the true model. 

One of the advantages of the tree-structured approach is that the tree procedure 
output gives easily understood and interpreted information regarding the predictive structure 
^. of the data. The tree procedure output, ahnost universally, provides an illuminating and 
natural way of understanding the structure of the problem (Breiman et al. (1984), p. 58). 
However, extensive exploration and careful interpretation are necessary to arrive at sound 
conclusions (Einhom (1972), Doyle (1973), Breiman et al. (1984)). - 

We will use the words "tree-shape" and "tree-structure" for different meanings. We 
define a tree-shape in terms of nodes and the directed arcs connecting the nodes. We 
define, for a given tree-shape, a tree-structure by assigning the selected predictor variable 
to each node and describing how to split the variable at the node. Figure 1.1 is an example 
of a tree-structure, where observations are made at the circles; decisions or predictions are 
made at the boxes. We use Figure 1.1 as follows. Suppose that the predictor variables are 
all binary, taking on the values 0 or 1. First, we observe the predictor variable X^. l£Xi= 1, 
we stop observing and make prediction; otherwise, we observe Xj- Th^ subsequent actions 
follow accordingly. If we delete all the letters and numbers from Figure 1.1, the remaining 
one is a tree-shape. We, however, use the terms "tree" and "tree-structure" in the same 
sense. 

(Figure 1.1 about here) 

Suppose we have a data set from a statistical model, and a tree is obtained based on 
the data set. With the sample size fixed, we repeat generating a data set from the same 
model and then obtaining a tree based on the data set. If the tree-structures are all the same 
over the repeated process, the tree-structures are said to be perfectly stable; otherwise, 
unstable with a level of instability, as will be discussed later in the paper. 

We consider, for example, the tree in Figure 1.1. We label the node of by the 
index i. Suppose there are several comparably informative variables at node 2. The variables 
appearing at node 3 will change according to the variables at node 2. A different variable 
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at node 2 may change the variables at the subsequent nodes. This phenomenon seems to 
erode the interpretability of the data structure by the tree approach. Is it really the case? 

Breiman et al. discussed interpretability of the data structure via the tree output in 
their section 5.5. Instability of tree structures is a key issue there, and it certainly deserves 
a lot more investigation, since instability is a crucial obstacle to more sound interpretability. 
What are the factors that cause instability in trees? How do the factors affert instability? 
These issues will be investigated, in this paper, at a node of a tree under the assumption 
that the data are from a linear regression model. By seeing what factors involved in a 
regression model cause instability and how they do, we could have a better insight into the 
true statistical property behind data in the mist of instability. Understanding the instability 
issue at a node will give us an Insight into the issue for a whole tree, since the same theory 
applies to all the nodes of a tree. 

We consider a regression model 

y = ^0 * * "• *^r^r * e , (1.1) 

where e has M(0, a] ) distribution, and is independent of (X^, -,X^. We suppose we have 
a data set of size n from the model (1.1) such that the observation is 

(^y7» ■"» ^jf> yj) • 

For a vector or matrix^. A' means the transpose of ^. We let 







X' 


« (^j, -, X^ , 




= (^0' ^1' "' ^r)' 


and €' 


~ (^1» ■"» 0» 


where X'j 


= (1, Xji, Xj2, •", Xj^), for j = 1, 2, •", n . 
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Then, for a given data set (AT, y), the Is estimate of ^ is given by 
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^ = {X'X)-^X'y_, (1.3) 

under the assumption that X'X is of full rank. It is to be noted that X^, X2, - X^ and Y are 

all assumed random variables. 

The regression model as described above will be assumed throughout the paper. This 
paper consists of 5 sections. In Section 2, we introduce measures useful in dealing with the 
instability problem of the tree approach. The unbiased estimators of the measures 
introduced in Section 2 are derived in Section 3. In Section 4, instability of trees is 
illustrated using the unbiased estimators derived in Section 3. Finally, Section 5 presents 
several comments on the results of this paper. 

2. MEASURES FOR THE TREE-STRUCfURED REGRESSION ANALYSIS. 

Assume that all the AT variables are finitely discrete or categorical. Suppose there is 
a data set generated from the model (1.1). Then, we have 

V{Y) = 2;,^ * a], (2.1) 
where = (^^^ p^), and 2^ is the variance-covariance matrix (VCM) of the 

column vector AT. which is given by x' = (1, X^, ATj, -, AT,). 



Definition 2.1 Let X^, -, X^ Y be random variables. 

For an integer 5, 1 < 5 < r, let {i^, i^, . ., /J be a subset of 

{1, 2, r) and X' = (A^. , AT,^, AT,/ . Then, for / e {1, 2, r}\{i„ i^, /,}, we let 

IVx,\x' . = nY\X^ ' X-) - E{V(Y\X_' - x\ Xj ) \X_' = x ') 
. V(E(Y\X' 'x',Xp\X' «x-). 
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We call TFy . ^. the improvement value (IV) by Xj given X* = x*. W confusion 

is not likely, we will write /K,, . for /F^.v. . . In the tree approach, we analyze the 

relationship between Y and the set of the AT-variables by selecting the AT-variables one after 
another. At the initial selection, select the AT-variable for which 

/F^ = V(E(y\X)) (2.3) 
is maximized. Let the selected variable be X^. Tlien, for X^ = x^, say, repeat the same 
process. That is, select the AT-variable for which 

f^x\x, • X. = V(.E(X\X, X, = X,) \X, = X,) (2.4) 
is maximized. If such difference as in (2.4), say, is equal to zero, then we stop the selection 
process. 

A careful look at the /F would give us an insight into the relationship between the 
tree-structure and the regression model. At this point, we need the theorem below. 
For notational convenience, we will use Xq for the first element (=1) of X- 

Theorem 2.2 

Suppose the following two conditions hold for the regression model (1.1): 

(i) X = (ATq, ATj, -, XJ' is a random vector with a VCM ♦ 

(ii) the coefficients Pq, 0^, -, are known. 

Then, under the set-up of Definition 2.1, we have 

where "S^^^^ . ^. is the VCM of x conditional on that X* = x*- 
Proof : Its proof is straightforward from the regression model (1.1). 
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V{Y\X' =x') = K(Ar'/3 + €\X* 

= V{X' ^\X' + a\ 

Similarly, we have 

By Definition 2.1, (2.5) foUows.O 

If confusion is not likely, we will write for . Theorem 2.2 says that 

the IV depends upon the regression coefficients and the (conditional) VCM of X (given a 

subset of {Zi, -, Xf}). 

IV for the initial selection ofX^ variable, is given by 

IV^^ = ^ - £(2^,;,) ^ . (2.6) 

The variation among the IVs deserves our attention since it has something to do with 
instability of trees. The following corollary is immediate from Theorem 2.2, and thus proof- 
omitted. 

Corollary 2.3 

Under the same set-up of Theorem 2.2, for / and j' (j # y), both in 
{1, 2, .., r} / ., I,}, we have 

JVxj\.'_ - JVx^A.^ = ^' ? - E(2^„_. ^ . (2.7) 
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In particular, if r = 2 in the regression model (1.1), we have, from (2.7), that 

J^x, - ^x, = ^\ W{X,\X^)) - fil W(X,\X,)) . (2.8) 

From equation (2.8), we can at least say that, the larger or E(V(Xi IXz)), the higher would 
the probability be that X^ is selected rather than ATj. 

In this section, we have found an expression for IV under the condition of 
Theorem 2.2. 

3. UNBIASED ESTIMATORS 

In this section, I will derive an unbiased estimator of the jVy, introduced in Section 

2. 

Lemma 3.1 Let Wj, -, W„ be random variables, and A an nxn matrix. If 
W = 0V„ .., WX then 

B(W' AW) = BiW) A B^W) + 2 a. cov(W^, W), 
- - - - ij 

where a,j is the (i, /)'** entry of ^4. 
Proof : From the equation 

WAW^XW^WjOij, 

ij 

the desired result is a straightforward consequence.D 
Theorem 3.2 

Let B be a (r + 1) X (r + 1) matrix. Then, given the data (X, y) from the 
regression model (1.1), 
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(3.1) 



Proof : By substituting (1.3) in (3.1), we have 

= p X'XQC'X)-^ B{X'X)-^X'Xp + c 

= p B p + c, 

where 

c = tT{X{X'X)-^ B(X'X)-^ X) 
= tr(5(;^{S0"^ 



by Lemma 3.1. Q.E.D. 

Suppose we have a data set of size n from the model (1.1). Let /„ be the nxn 
identity matrix, and /„ the nxn matrix of I's. Then we have 

-1^ b[xXI, - 1 /.k) - 2^ . (3-2) 

For the given data set, suppose that rij^ cases have AT^ = Xj, then the summation of rij^^ over 
all the possible values of Xj of Xj is equal to n. Let x^j^,^ be the rij^^ x (r + 1) matrix 

composed of the rows of X each of whose 0* + l)*** entries is Xj. 
In analogy to (3.2), we have 
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We define 

= -V ^'(^n - ^ . (3.3) 
n-L n 

and 

txix .X = ^L) - —Jn )Xnr^' (3.4) 

)< *j Jt 'j 

Then, we can see that, if Xj and the other X-variable are not independent, 

^ ^ ^ATIA} . Xj = ^ATIA} • Say, (3.5) 



is an unbiased estimator of E(2y,y) ; otherwise tv>v is given by (3.8) below. 
In (3.5), the summation is done over the support set of Xj. 

We let / = (/j, for 1 < j :< r , and let D^-^ be the (r + 1) x (r + 1) 

diagonal matrix where 



the (/, y)'" entry = " 



0 if y € { /j, •., /J 

1 otherwise . 



We let a = {1, 2, -, r}\{i^, i^, -, /,}. If x* and X. are independent for j e a, then 



(0 ('■) 

= D - X^D ~ for each possible value x* ofX*- 

Recall that the matrix X in expression (1.2) is a random matrix. For a given set of 
data (X, y\ suppose we fit the linear regression model (1.1), and the Is estimator of p is 

denoted by . 
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Under the normality assumption of e and the independence assimiption of ATj, - 

the mean squared error (MSE) from the least-square fit of the model (1.1) is the unique 
minimum variance unbiased estimator of a] (AtiquUah (1962)). We will denote the MSE 

by a] . We can also find the uniformly minimum varianced unbiased (UMVU) estimator 

of a] , when X^, - X, are correlated (Theorem 4.1 of Lehmann (1983)). We also denote the 

estimator by a^. 

The /y value in Definition 2.1 depends on the joint distribution of Z,, X^, - X^ and 
y. If we base the /F value on the Z-matrix of a given data (X, y), and denote such /F by 

IV^, then we may write 
Theorem 3.3 

Suppose the regression model (1.1) is true. Then, given the data (X, y), the statistic 
given below is an unbiased estimator of IV^ for / e a: 

= p^,,. - t,,,, - a] trp^,_. - ,)iX'X)-^j (3.7) 

Proof : Under the normality assumption of e . we can always find the UMVU estimator 

a] of a]' The rest of the theorem follows immediately from Theorem 3.2. □ 
We suppose that 
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[IND-1] for j € a, X', X., and the vector of the rest of the yiT-variables are mutually 

independent. 
Then, 

S;^,,. = £> - £> - and ^x\x\Xi = ^ " ^x^~ • ^^'^^ 

Thus in the mutual independence situation, we have, from (2.5), 

, (0 ^ (O , (I-.;-) ^ (I-.;-) 

jv^ thus be written as follows: 

Xj\x- 

Consequently, we have the following result. 
Theorem 3.4 

Suppose the regression model (1.1) is true. Then, under the independence condition 
[IND-1] and given the data (jjr, y) from the model (1.1), the statistic given below is an 

unbiased estimator of IV^^ . in expression (3.9): 

- K, (3.10) 
where 

K = a\ [S S.,{X'X)-;l * 2 5^. {X'X)-A (3.11) 
Su is the sample covariance ofX^ and X,, 



-p'D- txD- /3. (3.9) 



■'^A}!*' ''x 



('•) . (O 
D- txD- 



- D 



(.i.J) 



(.i.J) 

D - 



17 



12 



Au is the (fe, /)* element of matrix^, and Uj = a\{/}. 



Proof ; The proof is sufficient if we show equation (3.11). 



= J_ t^D-XXl„ - - J„)X D-\X'X) 
n-\ 



(«•) 



n 



by (3.2) 



(3.12) 



For the first term in (3.12); 



trio^'^. 



X'X D-iXiX)-"] = xi{D-X'XD-) {D-iX'X)-"- 



D-)] 



(3.13) 



For the second term in (3.12); 

trp -X'J^-{X'X)-^ 



= tr((D -X'J^ - ){p - {X'X)-^D - ) 
= 2 2 (Z'/A (^'-^* • 

Area Ua 



(3.14) 



After a simple algebra, we have 

X'J^ = /i^M, (3.15) 
where M is the (r + 1) x (r + 1) matrix, with its (i + 1,; + 1)'" entry being 



2 ^Ji ^J/n^ 
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From (3.14) and (3.15), we have 



(0 



kta Ua 



By (3.12), (3.13), and (3.16), we have 



(n-1) tr 



= 2 2 [{X'X)„ - nM^X'X)-^ 



kta Ua 



= (n-1) 2 2 5„ (X(Y);' . 

kta Ita 



(3.16) 



(3.17) 



By the same argument, we have 
(n-1) tr 



(«■.;■) . («■-;■) , , 
D- txD- (X'X)-^ 



= (n-1) 2 2 {X'X)i 



ktaj Itaj 



From (3.17) and (3.18) follows 



tr|£>^'^ tx D-\x'X)-^ 



- tr 



0,1) . («■-;•) , , 
D- txD- (X'X)-^ 



= 2 Sj, (X'X)^ + 2 5^. iX'X)]^ 



Ita 



kta, 



(3.18) 



Therefore, by Theorem 3.3 and expression (3.8), we get the desired result □ 



It is noteworthy that the unbiased estimator of jy^ . in Theorem 3.3 depends on 

Xj\x' 

the Is estimator of 0 based on the whole data (X, y) rather than based on any subset of 
the data (jljf, y) corresponding to the outcome = x\ Meanwhile, the estimator depends 



on the conditional covariance structure of the X-variables. 
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Under the independence assumption of Xj.Xj* ~»-^f» the bias term K is non-negative 
for large n. Thus ignoring ti)e bias term results in overestimation. Actually, we may take 5*; 
= 0,k * I, for large n. Then, from (3.11), 

since the independence assumption implies that {X'X)~^ is positive definite, and so the 
diagonal elements are all positive. 

4. ILLUSTRATIONS 

In this section, we will see some simple examples of instability for several causal 
factors of it which are discussed in Section 3. Tree-structures may be unstable partially due 
to chance fluctuations in the data or due to associations between the variables (see 
Subsections 5.5.2 and 8.10.1 of Breiman, et al. (1984)). The last paragraph of Subsection 
8.10.1 may have to be read with discretion. For the regression model used in their Section 
8.6, the regression coefficients are all different by some amount, while the variances of the 
X-variables are all within a small range. In this situation, the tree-structure may be very 
stable, as will be shown in Example 4.1. 

Example 4.1 

Consider a regression model (1.1) with r = 2, and suppose that the A^variables are 
independent. If there are no AT-variables already known, i.e., {ij, i^, -, /,} = then, from 
(3.10), we have 

IVjl^ - h'x (t^ - DO^ DOO)^^ - a\ Sjj {X'X)]l . (4.1) 

Since there are only two A'-variables, we may look at 

DIV,^ = IVl - IVI 

to see which -Y-variable is actually selected based on a given data set. From (4.1) follows 
DIV,^ = 5„ - fi, * d\ [s^iX'X)-^ - S,,{X'X)-^. (4.2) 
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For simulation, we consider a version of the regression model (1.1) (call it M-1) 
under the following conditions: 

(a) Po = Pi = ^2= 1. 

(b) a] = l, 

(c) P(X^ = 1) = 0.2, P(X^ = 2) = O.S, P{X2 = 1) = PiX^ = 2) = 0.5 . 

If the DIV-value is positive, then we select ATj variable; if negative, being selected. 
If the DIV is equal to zero, then both variables are equally likely. This selection rule is the 
same as the CARTs with the "least-square" selection criterion of CART. Table 4.1 is 
obtained based on 10 data sets of size 100 each from the model M-1. Each row corresponds 
to each data set. 

(Table 4.1 about here) 
As indicated in Table 4.1, there is some uncertainty in variable-selection. To get some idea 
of uncertainty, we generated 500 data sets of size 30 each. Figure 4.1 is the histogram of 
the 500 D/K-values. 

(Figure 4.1 about here) 
Table 4.2 shows the selection rates of variable out of 1,000 iterations for each 
specified regression model. For the table, we allowed 1, 2, and 3 for /3i; (0.2, 03), (0.2, 0.5), 
(0.3, 0.4), and (0.3, 0.5) for (P(Zi = 1), ?(X2 = 1)); 5, 10, 30, and 50 for the sample size. 
The values in the row of E(DIV) (call it the "true DIV") are obtained from iv^ - IV^. 

(Table 4.2 about here) 

(Figure 4.2 about here) 
Table 4.2 is graphed in Figure 4.2, where the numbers on the right margin or on the 
lines are the true DIV values. From the graph we can see that the selection rate depends 
on the true DIV and the sample size. When the true DIV is larger than or equal to 0.39, 
the selection rate is not less than 0.75 even at the sample size 10. On the other hand, for 
the true DIVs between -0.09 and -0.03, the selection rate is not less than 0.3 even for the 
sample size 50. 

(Table 4.3 about here) 
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Table 4.3 shows the relationship between the selection rate ofX^ and the sample 
size for the regression model M-1. According to the table, we need a sample of size larger 
than 300 to reach the selection rate ofX^ 0.1, and 600 to reach the selection rate 0.05. This 
is an extreme situation compared with the case where )3i = 2 or 3 in Table 4.2.n 

We may safely conclude from Example 4.1 that the absolute distance between the 
selection rate and 0.5 increases 

(i) as the absolute value of the true DIV increases for each sample size, or 

(ii) as the sample size increases for each true DIV. 

We define the level of instability or the instability level (at a node) to be equal to 0.5 minus 
the above-mentioned absolute distance. Thus the instability level is between 0 and 0.5 
inclusive. O means "the lowest instability (i.e., perfect stability)", and 0.5 "the highest 
instability". 

As implied by expression (3.7), the level of instability depends on the sample size, 
the association level among the X-variables, a,^ and a . If we knew jVy . for any subset 

{X, -, X.} of {X, •", X,} and ; € a = {1, 2, ~, r}\{i^, i^, ~, Q, then from the tree 
which is obtained based on /K . could we see which X-variable partitions the population 

Xj\x 

SO that the partitioned subgroups are most homogeneous with respect to Y, i.e., the within- 
group variances of Y are minimized; and so on, for all the subsequent nodes. That is, 
conditional on that a set of X-variables are already observed at the previous nodes, we select 
the X-variable which divides the current subset of the population into mostly homogeneous 
subgroups. If we say that is an unknown parameter, then we may say that the tree 

■A, y 

which is obtained based on the data (X, y) is an estimate of the parameter. As indicated 
in Example 4.1, we can expect that the tree ty will approach Y,, as the sample size 
increases. 
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However, T« may not be an interesting object, because T« does not necessarily show 
the whole picture of the corresponding statistical model. This is analogous to that the 
scatterplots of all the pairs of 7 and -X" variables do not reveal the joint structure of the data. 
In some sense, instability of trees can be a signal to data analysts that further investigation 
is desirable on data. 

The example below is continued from Example 4.1, and illustrates how the noise in 
the regression model affects instability of trees. 

Example 4.2 

Consider a regression model (1.1) with r - 2, which satisfies condition (c) for model 
(M-1) of Example 4.1, and /3o = /^a ~ ^^^^ example, we allow 1, 2, and 3 for p^, and 
2, 3, and 4 for c,, and see how the selection rate ofX^ changes. The selection rates for 
= 1 are in Table 4.2. Table 4.4 is obtained by the same method as for Table 4.2 (the 
number of repeat = 1,000). From (3.6), we can see that the true DIV has nothing to do 
with the noise (ct,). Expression (4.2) says that the noise affects the tree-instability through 
the bias term. 

Table 4.4 says that instability of trees becomes serious as the noise (o-,) to the 
response variable increases. From Table 4.4 and the fourth column of Table 4.2, we can see 
that the selection rate ofX^ gets closer to 0.5 as the noise increases for each sample size. 

Expression (3.7) explains this phenomenon. But, since (X'X)'^ converges in the order of 
o|i|, the instability due to the noise (c,) can be overcome by increasing the sample size 
only. 

(Table 4.4 about here) 

□ 

Next, we will consider a case where X-variables are associated. 
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Consider two versions of the regression model (1.1) with r=2, where the X's are 
binary (0 or 1), and the two models differ in the joint probability of the X's. Their joint 
probabilities are given by Table 4.5 (a) and (b), respectively. We call the model 
corresponding to Table 4.5 (a) by model (M-2a); Table 4.5 (b) by model (M-2b). 

(Table 4.5 about here) 

We put 

(a) ^0 = ^2 = 1, and 

(b) a] = 1. 

The marginals of and X2 for both models are as in (c) of model (M-1) of Example 4.1. 
We aiiow 1, 2, and 3 for 0^ in the simulation. 

Table 4.6 shows the selection rates ofX^ variable out of 1,000 iterations for each 
specified regression model (changing values for /S^). The association between and A^j (the 
correlation coefficients of the X variables are 0.4 for model (M-2a) and 0.25 for model (M- 
2b)) shrinked the true DIV values towards 0 a little bit, leading to a higher level of 
instability (compare Table 4.6 with the fourth column of Table 4.2). The fact that Xi and 
X2 in model (M-2a) are correlated more strongly than those in model (M-2b) is reflected 
in the true DIVs, and in turn in the selection rates. Table 4.6 suggests that, provided that 
the marginals ofX^ andZj are fixed, the higher level of instability is for the larger absolute 
value of the correlation coefficient. 

(Table 4.6 about here) 

□ 

Finally, a simple example follows where we will see how the variation in X can 
contribute to instability of trees. 



Example 4.4 

Consider a simple regression model with r = 1, and the X variable is binary (0 or 
1) with P(Xi = 1) = p. Let the data size be equal to n. Then, 



X'X = 



n s 
U 5) 
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where 5 is the number of the case with AT^ = 1 in the data set. 
Now, 



s -s 
[-S n) 



2\-l 



yielding 



a] 

— E 
n 



\n\ n)) 



(4.3) 



By Jensen's inequality, 

y0O 'j {E{i 



1 - i 



- w 



(4.4) 



From (4.3), we can say that V{^^) increases as p approaches 0 or 1 for a given a] . The 

inequality in (4.4) provides us with the greatest lower bound of V0^) for the given 

distribution of the variable. □ 

In this section, our purpose was to see some patterns of instability, and we 
considered some simple regression models. The regression models with larger r would 
complicate bur problem with only a little more gain, since the variable selection is 
essentially by pairwise comparisons of the IVs. 

It is to be noted at this point that the instability discussed in this paper is confined 
to a node of a tree, not over a whole tree. However, to understand the instability of trees 
do we need to understand the instability at each node. 

In this section, we have seen, for a regression model with r=2, 

(1) that the instability level increases as the absolute value of the true DIV decreases, 

(2) that the instability due to the noise to the regression model can be cured by 
increasing the sample size only. 
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(3) that when the absolute value of the true DIV is small (less than 0.1), increasing the 
sample size will be of little help; on the other hand, wh.-ri the absolute value is not 
less than 0.4, the instability level looks good (the selection rate ofX^ is over 0.8) for 
the sample size around 30, and 

(4) that if we compare the instability levels from any two regression models, both of 
which are the same except that the X's are independent in one model, and not for 
the other, then the instability level may be lower for the independent case than for 
the other case. 

Relationship between the DIV or IV and the instability level at each node seems 
to deserve further study. 

5. DISCUSSION. 

At the outset, the consideration of the tree approach for a data set obtained from 
a linear regression model may sound like nonsense. However, if all the regressor variables 
involved are finitely discrete, then fitting a regression model is equivalent to partitioning the 
sample space generated by the regressor variables involved in the model fitting. If the same 
set of regressor variables that are involved in the model fit is used in the tree approach, 
then the derived tree, in general, gives rise to a partition of the sample space coarser than 
the one corresponding to the regression approach. This is an advantage of the tree 
approach over the classical regression approach as far as the prediction accuracies are of 
an equivalent level. 

Many criteria are developed for choosing the best regression models (Seber(1977), 
Miller (1990)). Among them are the coefficient ot determination (R-square), Mallows' Cp, 
and MSEP. Any of these seems hardly applicable to selection of the final tree. If we have 
a careful look at the expressions (2.5), (3.6), and (3.7), we can see that the tree is 
determined by the Is estimate of p , and the relation among the X-variables. In regression. 
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the estimate of ^ changes for different sets of regressors; while, in the tree-approach, we 
use the same estimate of ^ all through the tree-construction process. 

Instability of the tree structure is certainly a drawback in the tree approach, but it 
also is a signal for further investigation for a sound interpretation of the stochastic 
properties behind data. Based on the theoretical results and the examples of this paper, I 
can safely say the foUowings: 

(1) If instability is seen near the bottom of a tree, it may be due to the pure noise in 
data. Increasing the sample size may help. 

(2) If instability is elsewhere, it may be due to association among the regressor or 
predictor variables. In this case comparing different tree-structures may help for a 
better insight into the nature behind data. 

Instability at a node near the top would affect the whole tree-structure and tree-shape. 
If the rVs of a set of regressors are more or less at the same level, the instability level may 
decrease at a very slow rate (see, for example. Table 4.3). Thus even for large sized data, 
it is not very surprising to see instability. In such a situation, those trees that show up at 
comparable frequencies (suppose we repeat random subsampling from a data set generated 
from a statistical model and constructing trees based on the subsampled data lots of times) 
may deserve equal attention for a sound interpretation of the stochastic properties behind 
data, since those competing regressor variables may equally be informative for the predicted 
or dependent variable. In this context, a computer program that can construct a tree where 
a particular regressor variable is split at a user specified node of the tree is desirable. With 
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this program, we can construct several trees from a data set, and use them for better 
interpretation of the stochastic properties behind the data. 



ERIC 



26 



References 



23 



Atiqullah, M. (1962). The estimation of residual variance in quadratically balanced least 

squares problems and the robustness of the F-test. Biometrika, 49, 83-91. 
Breiman, L» Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and 

Regression Trees. Wadsworth International Group, Belmont, CA. 
Doyle, R. M. (1973). The use of automatic interaction detector and similar search 

procedures. Operational Res. Quart., 24, 465-467. 
Einhom, H. (1972). Alchemy in the behavioral sciences. Pub. Op. Quart., 36, 367-378. 
Huang, M. C. (1989). Piecewise linear tree-structured regression. Unpublished Ph. D. thesis. 

Dept. of Statistics, University of Wisconsin-Madison. 
Lehmann, E. L. (1983). Theory of Point Estimation. John Wiley & Sons, Inc. 
Miller, A. J. (1990). Subset Selection in Regression. Chapman and Hall, London, England. 
Morgan, J. N. and Messenger, R, C. (1973). THAID: A sequential search program for the 

analysis of nominal scale dependent variables. Ann Arbor: University of Michigan, 

Institute for Social Research. 
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a 

proposal. J. A. S. A., 58, 415-434. 
Seber, G. A. F. (1977). Linear Regression Analysis. John Wiley & Sons, Inc. 



29 



25 

tABLE 4.1 



DJV 


Variable-selection 
by CART 


0.016 


^1 


-0.3 
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0.056 
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Table 4.2 









{P{X, = 1), P{X^ = 1)) 




^1 


Sample Size 


(0.2, 03) 


(0.2, 0.5) 


(0.3, 0.4) 


(0.3, 0.5) 


1 


5 


0.52 


0.50 


0.50 


0.50 




10 


0.49 


0.43 


0.46 


0.48 




30 


0.41 


0.37 


0.45 


0.43 




50 


0.38 


0.32 


0.42 


0.44 




EipiV) 


-0.05 


-0.09 


-0.03 


-0.04 


2 


5 


0.74 


0.71 


0.74 


0.74 




10 


0.78 


0.75 


0.80 


0.81 




30 


0.88 


0.85 


0.93 


0.94 




50 


0.93 


0.91 


0.97 


0.97 




EipiV) 


0.43 


0.39 


0.6 


0.59 


3 


5 


0.89 


0.90 


0.91 


0.89 




10 


0.93 


0.92 


0.95 


0.96 




30 


0.98 


0.98 


1.00 


1.00 




50 


1.00 


1.00 


1.00 


1.00 




E{DIV) 


1.19 


1.15 


1.66 


1.65 
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Table 4.3 



Sample Size 


Selection Rate of Xi 


5 


05 


10 


043 


30 


037 


50 


032 


75 


0.26 


100 


0.23 


150 


0.2 


200 


0 17 


300 


.0.12 


400 


0.09 


500 


0.07 


600 


0.043 


700 


0.03 


800 


0.023 


900 


0.022 
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Table 4.4 











Sample 
oize 


2 


o 


A 


^1 = 1 


c 

5 


0.497 


0.49 


0.524 




10 


0.466 


0.469 


0.51 






0.435 


0.438 


0.451 




DU 


0.381 


0.456 


0.418 


/5i = 2 


c 
D 


0.584 


0.539 


0.548 




10 


0.622 


0.614 


0.542 




30 


0.695 


0.648 


0.646 




50 


0.781 


0.69 


0.636 


^1 = 3 


5 


0.716 


0.623 


0.584 




10 


0.764 


0.681 


0.665 




30 


0.887 


0.823 


0.763 




50 


0.952 


0.867 


0.798 
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Table 4.5 
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X, 



0 



0 

1 



0.48 


0.32 


0.02 


0.18 



X2 

0 1 



X, 



0 

1 



0.45 


0.35 


0.05 


0.15 



(a) 



(b) 
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Table 4.6 







Selection Rate 




Sample Size 


M-2a 


M-2b 




c 

5 


0.48 


0.51 




10 


0.475 


0.47 


1 


30 


. 0.4 


0.35 




SO 


0.36 


0.34 




p/'T^TV^ 

X-'^JL'l V f 


-0.076 


-0.084 




c 
D 


0.66 


0.7 




10 


0.695 


0.697 


2 




0.79 


f\ rtA£" 

0.805 




50 


0.87 


0.89 




E(DIV) 


0.33 


0.366 




5 


0.82 


0.87 




10 


0.87 


0.89 


3 


30 


0.97 


0.975 




50 


0.99 


0.996 




E(DIV) 


1 


1.12 
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